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Abstract 


'  A  method  is  presented  for  the  recovery  of  the  three-dimensional  translation 
of  a  rigidly  moving  textured  object^ The  novelty  of  the  method  consists  of  the 
fact  that  four  cameras  are  used  in  order  to  avoid  the  solution  of  the 
correspondence  problem.  The  method  seems  to  be  immune  to  small  noise 
percentages  and  to  have  good  behavior  when  the  noise  increases.  '} 


This  work  was  supported  by  a  research  contract  from  the  U.S.  Army  Engineer 
Topographic  Laboratories  (Number  DACA  76-85-C-0001). 
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1.  Introduction 


An  important  problem  in  computer  vision  is  to  recover  the  three-dimensional 
motion  of  a  moving  object  from  its  images.  Up  to  now,  there  have  been  three 
approaches  towards  the  solution  of  this  problem: 

1)  The  first  assumes  the  dynamic  image  to  be  a  three-dimensional  function 
of  two  spatial  arguments  and  a  temporal  argument.  Then,  if  this 
function  is  locally  well-behaved  and  its  spatiotemporal  gradients  are 
computable,  the  image  velocity  or  optical  flow  may  be  computed  [1,  2,  3, 
5,8]. 

2)  The  second  method  for  measuring  image  motion  considers  the  cases 
where  the  motion  is  "large"  and  the  previous  technique  is  not  applicable. 
In  these  instances  the  measurement  technique  relies  upon  isolating  and 
tracking  highlights  or  feature  points  in  the  image  through  time.  In 
other  words,  operators  are  applied  on  both  dynamic  frames  which  output 
a  set  of  points  in  both  images,  and  then  the  correspondence  problem 
between  these  two  sets  of  points  has  to  be  solved  (i.e.,  finding  which 
points  on  both  dynamic  frames  are  due  to  the  projection  of  the  same 
world  point)  [9,  39, 40, 4]. 

In  both  the  above  approaches,  after  the  optical  flow  field  or  the  discrete 
displacements  field  (which  can  be  sparse)  are  computed,  then  algorithms  are 
constructed  for  the  determination  of  the  three-dimensional  motion,  based  on  the 
optic  flow  or  discrete  displacement  values  [6, 10, 11, 12, 13, 15, 16,  20,  21,  22,  23, 
24,  25,  26,  28,  30, 31, 32, 33, 34, 35, 41, 42]. 

3)  The  three-dimensional  motion  parameters  are  computed  directly  from 
the  spatial  and  temporal  derivatives  of  the  image  intensity  function.  In 
other  words,  if  f  is  the  intensity  function  and  (u,  v)  the  optical  flow  at  a 
point,  then  the  equation  fxu  +  fyv  +  ft  —  0  holds  approximately.  All  the 
methods  in  this  category  are  based  on  substitution  of  the  optical  flow 
values  in  terms  of  the  three-dimensional  motion  parameters  in  the  above 
equation,  and  there  is  very  good  work  in  this  direction  [36, 37, 17], 


As  the  problem  has  been  formulated  over  the  years,  one  camera  is  used,  and 
so  the  three-dimensional  motion  parameters  that  have  to  be  computed,  and  can 
be  computed,  are  five  (two  for  the  direction  of  translation  and  three  for  the 
rotation).  In  our  approach,  four  cameras  are  used  to  recover  the  three 
translation  parameters,  instead  of  the  direction  only  of  the  translation,  and 
despite  the  fact  that  our  theory  assumes  that  the  object  in  view  is  only 
translating,  our  results  (i.e.,  the  three-dimensional  translation)  are  affected 
very  little  even  if  the  object  is  moving  with  a  small  rotation,  in  addition  to  a 
translation. 

2.  Motivation  and  Previous  Work 

The  basic  motivation  for  this  research  is  the  fact  that  optical  flow  (or  discrete 
displacement)  fields  produced  from  real  images  by  existing  techniques  are 
corrupted  by  noise  and  are  partially  incorrect  [7],  Most  of  the  algorithms  in  the 
literature  that  use  the  retinal  motion  field  to  recover  three-dimensional  motion 
fail  when  the  input  (retinal  motion)  is  noisy.  Some  algorithms  work  reasonably 
for  images  in  a  specific  domain. 

Some  researchers  [23,  31,  32,  41,  13,  33]  developed  sets  of  nonlinear 
equations  with  the  three-dimensional  motion  parameters  as  unknowns,  which 
are  solved  by  iterations  and  initial  guessing.  These  methods  are  very  sensitive 
to  noise,  as  it  is  reported  in  [23,  31, 13,  33].  On  the  other  hand,  other  researchers 
[26,  42]  developed  methods  that  do  not  require  the  solution  of  nonlinear  systems, 
but  the  solution  of  linear  ones.  Despite  that,  under  the  presence  of  noise,  the 
results  are  not  satisfactory  [26,  42]. 

Bruss  and  Horn  [12]  presented  a  least-squares  formalism  that  tried  to 
compute  the  motion  parameters  by  minimizing  a  measure  of  the  difference 
between  the  input  optic  flow  and  the  predicted  one  from  the  motion  parameters. 
The  method,  in  the  general  case,  results  in  solving  a  system  of  nonlinear 
equations  with  all  the  inherent  difficulties  in  such  a  task,  and  it  seems  to  have 
good  behavior  with  respect  to  noise  only  when  the  noise  in  the  optical  flow  field 
has  a  particular  distribution.  Prazdny,  Rieger,  and  Lawton  presented  methods 
based  on  the  separation  of  the  optical  flow  field  in  its  translational  and 
rotational  components,  under  different  assumptions  [21,  22].  But  difficulties  are 
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reported  with  the  approach  of  Prazdny  in  the  present  of  noise  [34],  while  the 
methods  of  Rieger  and  Lawton  require  the  presence  of  occluding  boundaries  in 
the  scene,  something  which  cannot  be  guaranteed.  Finally,  Ullman  in  his 
pioneering  work  [6]  presented  a  local  analysis,  but  his  approach  seems  to  be 
sensitive  to  noise,  because  of  its  local  nature. 

Several  other  authors  [20,  30]  use  the  optical  flow  field  and  its  first  and 
second  spatial  derivatives  at  corresponding  points  to  obtain  the  motion 
parameters.  But  these  derivatives  seem  to  be  unreliable  with  noise,  and  there  is 
no  known  algorithm  which  can  determine  them  reasonably  in  real  images. 
Others  [10]  follow  an  approach  based  partially  on  local  interpretation  of  the  flow 
field,  but  it  can  be  proved  [27]  that  any  local  interpretation  of  the  flow  field  is 
unstable. 

At  this  point  it  is  worth  noting  that  all  the  aforementioned  methods  assume 
an  unrestricted  motion  (translation  and  rotation).  In  the  case  of  restricted 
motion  (only  translation),  a  robust  algorithm  has  been  reported  by  Lawton  [35], 
which  was  successfully  applied  to  some  real  images.  His  method  is  based  on  a 
global  sampling  of  an  error  measure  that  corresponds  to  the  potential  position  of 
the  focus  of  expansion  (FOE);  finally,  a  local  search  is  required  to  determine  the 
exact  location  of  the  minimum  value.  However,  the  method  is  time-consuming, 
and  is  likely  to  be  very  sensitive  to  small  rotations.  Also  the  inherent  problems 
of  correspondence,  in  the  sense  that  there  may  be  drop-ins  or  drop-outs  in  the 
two  dynamic  frames,  is  not  taken  into  account.  All  in  all,  most  of  the  methods 
presented  up  to  now  for  the  computation  of  three-dimensional  motion  depend  on 
the  value  of  flow  or  retinal  displacements.  Probably  there  is  no  algorithm  until 
now  that  can  compute  retinal  motion  reasonably  (for  example,  10%  accuracy)  in 
real  images. 

Even  if  we  had  some  way,  however,  to  compute  retinal  motion  in  a 
reasonable  (acceptable)  fashion,  i.e.,  with  at  most  an  error  of  10%,  for  example, 
all  the  algorithms  proposed  to  date  that  use  retinal  motion  as  input  would  still 
produce  non-robust  results.  It  seems  that  the  reason  for  this  is  the  fact  that  the 
motion  constraint  (i.e.,  the  relation  between  three-dimensional  motion  and 
retinal  displacements)  is  very  sensitive  to  small  perturbations.  Table  1  shows 
how  the  error  of  motion  parameters  grows  as  the  error  in  image  point 
correspondence  increases  when  8-point  correspondence  is  used,  and  Table  2 
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shows  the  same  relationship  when  20-point  correspondence  is  used  with  2.5% 
error  on  point  correspondences  based  on  a  recent  algorithm  of  great 
mathematical  elegance.  (Tables  1  and  2  are  from  [26].) 

Table  1 :  Error  of  motion  parameters  for  8-point  correspondence 
for  2.5%  error  in  point  correspondence. 

Error  of  E  (essential  parameters)  73.91  % 

Error  of  rotation  parameters  38.70% 

Error  of  translations  103.60% 

Table  2:  Error  of  motion  parameters  for  20-point  correspondence 
for  2.5%  error  in  point  correspondence. 


Error  of  E  (essential  parameters) 

19.49% 

Error  of  rotation  parameters 

2.40% 

Error  of  translations 

29.66% 

It  is  clear  from  the  above  tables  that  the  sensitivity  of  the  algorithm  in  [26]  to 
small  errors  is  very  high.  It  is  worth  noting  at  this  point  that  the  algorithm  in 
[26]  is  solving  linear  equations,  but  the  sensitivity  to  error  in  point 
correspondences  is  not  improved  with  respect  to  algorithms  that  solve  non¬ 
linear  equations.  Finally,  the  third  approach,  which  computes  directly  the 
motion  parameters  from  the  spatiotemporal  derivatives  of  the  image  intensity 
function,  gets  rid  of  the  correspondence  problem  and  seems  very  promising.  In 
[17,  36,  15],  the  behavior  with  respect  to  noise  is  not  discussed.  But  extensive 
experiments  [38]  implementing  the  algorithms  presented  in  [37]  show  that  noise 
in  the  intensity  function  affects  the  computed  three-dimensional  motion 
parameters  a  great  deal.  We  should  also  mention  that  the  constraint 
fxu  +  fyV  +  ft  =  0  is  a  very  gross  approximation  of  the  actual  constraint  under 
perspective  projection  [43].  So,  despite  the  fact  that  no  correspondences  are  used 
in  this  approach,  the  resulting  algorithms  seem  to  have  the  same  sensitivity  to 
small  errors  in  the  input  as  in  the  previous  cases.  This  fact  should  not  be 
surprising,  because  even  if  we  avoid  correspondences,  the  constraint  between 
three-dimensional  motion  and  retinal  motion  (regardless  of  whether  the  retinal 


motion  is  expressed  as  optic  flow  or  the  spatiotemporal  variation  of  the  image 
intensity  function)  will  be  essentially  the  same  when  one  camera  is  used 
(monocular  observer,  traditional  approach).  This  constraint  cannot  change, 
since  it  relates  three-dimensional  motion  to  two-dimensional  motion  through 
projective  geometry. 

So,  as  the  problem  has  been  formulated  (monocular  observer),  it  seems  to 
have  a  great  deal  of  difficulty.  This  is  again  not  surprising,  and  the  same 
problem  is  encountered  in  many  other  problems  in  computer  vision  (shape  from 
shading,  structure  from  motion,  stereo,  etc.).  There  has  recently  been  an 
approach  to  combine  information  from  different  sources  in  order  to  achieve 
uniqueness  and  robustness  of  low-level  visual  computations  [44].  With  regard  to 
the  three-dimensional  motion  parameters  determination  problem,  why  not 
combine  motion  information  with  some  other  kind  of  information?  It  is  clear 
that  in  this  case  the  constraints  won’t  be  the  same,  and  there  is  some  hope  for 
robustness  in  the  computed  parameters.  As  this  other  kind  of  information  that 
should  be  combined  with  motion,  we  choose  stereo. 

The  need  for  combining  stereo  with  motion  has  recently  been  appreciated  by 
a  number  of  researchers  [14,  29,  45,  46].  Jenkin  and  Tsotsos  [14]  used  stereo 
information  for  the  computation  of  retinal  motion,  and  they  presented  good 
results  for  natural  images.  Waxman  et  al.  [29]  presented  a  promising  method 
for  dynamic  stereo,  which  is  based  on  the  comparison  of  image  flow  fields 
obtained  from  cameras  in  known  relative  motion,  with  passive  ranging  as  goal. 
Whitman  Richards  [46]  is  combining  stereo  disparity  with  motion  in  order  to 
recover  correct  three-dimensional  configurations  from  two-dimensional  images 
(othography-vergence).  Finally,  Huang  and  Blostein  [45]  presented  a  method 
for  three-dimensional  motion  estimation  that  is  based  on  stereo  information.  In 
their  work,  the  static  stereo  problem  as  well  as  the  three-dimensional  matching 
problem  have  to  be  solved  before  the  motion  estimation  problem.  The  emphasis 
is  placed  on  the  error  analysis,  since  the  amount  of  noise  (in  typical  image 
resolutions)  in  the  input  of  the  motion  estimation  algorithm  is  very  large. 

So  a  natural  question  arises:  is  it  possible  to  recover  three-dimensional 
motion  from  images  without  having  to  go  through  the  very  difficult 
correspondence  problem?  And  if  such  a  thing  is  possible,  how  immune  to  noise 
will  the  algorithm  be?  In  this  paper,  we  prove  that  if  we  combine  stereo  and 


motion  in  some  sense  and  we  avoid  any  static  or  dynamic  correspondence  by 
using  four  cameras,  then  we  can  compute  the  three-dimensional  translation  of  a 
moving  object.  At  this  point,  it  is  worth  noting  recent  results  by  Kanatani  [18, 
19]  that  deal  with  finding  the  three-dimensional  motion  of  planar  contours  in 
small  motion,  without  point  correspondences.  These  methods  seem  to  suffer  from 
numerical  errors  a  great  deal,  but  they  have  a  great  mathematical  elegance. 
Our  experiments  show  that  the  computation  is  very  reliable  even  in  the 
presence  of  noise,  or  even  when  the  object  in  view  is  not  only  translating  but  also 
rotating  with  a  small  rotation.  Table  3  shows  the  average  error  in  the  computed 
translational  parameters  as  the  noise  in  the  images  increases,  using  the  method 
developed  in  this  paper,  where  the  noise  was  randomly  generated. 


Table  3:  Error  of  Translation  Parameters 
vs.  Noise  in  Images 


Average  Error 
in  Images 

Approximate  Average  Error 
in  Translation  Parameters 

1% 

negligible 

5% 

negligible 

10% 

5% 

20% 

5% 

30% 

6% 

50% 

o 

OO 

67% 

15% 

75% 

20% 

90% 

unreliable 

Later  in  the  paper  we  will  formally  define  the  meaning  of  noise  and  measure  of 
the  error  in  the  computed  parameters. 


The  organization  of  this  paper  is  as  follows.  The  next  section  introduces  the 
reader  to  some  technical  prerequisites.  Section  4  describes  the  geometric  model 
and  the  developed  constraints.  Section  5  describes  the  algorithms,  and  Section  6 


presents  experiments  and  the  effect  of  noise  in  the  computation  of  three- 
dimensional  translation.  Finally,  Section  7  concludes  the  work  and  discusses 
future  research. 

3.  Technical  Prerequisites 

Consider  a  coordinate  system  OXYZ  fixed  with  respect  to  the  camera,  where 
O  is  the  nodal  point  of  the  eye  and  the  image  plane  is  perpendicular  to  the  Z- 
axis,  that  is,  pointing  along  the  optical  axis.  Let  us  represent  points  on  the 
image  plane  with  small  letters  (x,  y)  and  points  in  the  world  with  capital  leters 
( X ,  Y,  Z).  Let  a  point  P  =  (X,  Y,  Z)  in  the  world  have  perspective  image  (xi,  yi), 
where  xi  =  fX\!Z\  and  y\  =  fY \!Z\.  If  the  point  P  moves  to  P'  =  (X2,  Y2,  Z2) 
with 

X2  =Xi+  AX 
Y2  =  Yi  +  AY 
Z2  =  Z\  +  AZ 

and  P'  has  the  perspective  image  (x2,  yi),  then  it  can  be  easily  shown  that 

fAX  -  xxAZ 
X2~X1~  Zj  +  AZ 

fAY  -  yxAZ 

The  above  equations  relate  the  retinal  motion  of  an  image  point  with  the 
three-dimensional  motion  of  the  corresponding  world  point.  We  now  proceed 
with  the  description  of  the  imaging  system. 

4.  The  Model 

Let  OXYZ  be  a  cartesian  coordinate  system,  fixed  with  the  Z-axis  pointing 
along  the  optical  axis,  and  consider  the  image  plane  Im\  perpendicular  to  the  Z- 
axis  at  a  point  (0, 0,  f)  (focal  length  =  /).  This  is  obviously  the  model  of  a  camera. 
The  geometry  of  the  system  induces  a  natural  cartesian  coordinate  system  on 
the  image  plane  with  the  center  at  the  intersection  of  the  Z-axis  with  the  image 
plane,  and  the  x-  and  y-axes  parallel  to  the  X  and  Y  ones.  Furthermore,  consider 


three  more  cameras  with  image  planes  Imi,  Im 3,  and  Im.4  with  nodal  points  (dx, 
0,  0),  (dx,  dy,  0),  and  (0,  dy,  0),  respectively,  such  that  any  world  point  has  the 
same  depth  with  respect  to  any  of  the  cameras  (see  Figure  1). 


On  each  one  of  the  image  planes  a  coordinate  system  is  defined  exactly  as  it 
was  done  for  Im\.  From  now  on,  coordinates  of  three-dimensional  points  will  be 
denoted  with  X,  Y,  Z,  while  coordinates  of  points  in  each  of  the  images  will  be 
denoted  by  (xj,  yi),  ( *2 .  y2),  (*3.  y3),  (*4,  y4).  respectively.  Coordinates  of  image 
points  in  the  second  dynamic  frame  (i.e.,  projections  of  three-dimensional  points 
after  the  motion)  will  be  denoted  by  the  same  symbols  as  before  the  motion,  but 
primed  (i.e.,  (xi',yi'),  etc.).  Consider  a  set  A  =  {(Xj,  Yi,  Zi):  i  =  1, ...,  n}  of  points 
in  the  world,  which  translates  rigidly  along  the  vector  (AX,  AF,  A Z)  to  form  a 
new  set  A'  =  {(X;\  Yi' ,  Zi'):  i  =  1, ...,  n},  where  X;'  =  Xj  -I-  AX,  Yi'  =  F,  +  AY, 
Zi  =  Zi  +  A Z,  i  =  1, ...,  n.  From  the  projections  of  the  sets  A  and  A'  on  the  four 
cameras  we  wish  to  recover  the  quantities  AX,  AF,  A Z  without  using  any  static 
or  dynamic  correspondence. 


Let  the  projections  of  the  set  A  on  the  four  image  planes  be  {Ui,,  yi;),  i  = 
1, n},  {(x2j,  y2j)>  i  !■»  •••»  n},  {(^3i>  y3i)»  (  1)  •••»  n},  {(^-4i>  y4i)>  i  1>  •••»  n}, 

respectively,  and  the  projections  of  the  set  A'  be  {(jci;',  yij’),  i  =  1, ...,n},  {(x2t'> 
y2i'),  i  =  {(*3i’,  y3i'),  i  =  1, ....  n},  and  {(*4i\  y4t’),  i  =  lt...,n}, 

respectively.  To  simplify  things  for  the  reader,  consider  the  imaging  system  as 
shown  in  Figure  2. 
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Figure  2:  Orthographic  Projection  of  the  System  on  the  Plane  YZ 


We  proceed  with  the  following  propositions. 


4.1  Proposition!:  Using  the  aforementioned  nomenclature  the  quantity 


is  directly  computable  from  the  projection  of  the  points  of  the  set  A  on  Im\  and 
lm2. 


Proof:  Consider  a  point  (Xi,  Y Z,)  6  A  and  its  projections  A\  =  (xn,  yi;),  A2  = 
(*2i>  y2i)  on  Im\  and  Imz  respectively  (i.e.,  A 1  and  A2  are  corresponding).  Then 


and 


X 


2i 


A*  -  dx) 
Z 


(4.1.1) 


(4.1.2) 
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The  equation  4.3.1  proves  proposition  2. 

4.4  Proposition  3:  Using  the  aforementioned  nomenclature,  the  quantity 


n  X, 

I  — 


is  directly  computable,  from  the  projections  of  the  set  A  on  Im\  and  Im±. 
Proof:  Similar  to  (4.1.3),  we  can  derive 


l 

Z 


(4.4.i: 


Using  (4.4.1),  we  get 


n  X. 


y  —  =  — 

^  2.  fdy 


i  =  1 


n  1  if'1  n 

=z:  I  *uy«- I 

i  =  i  1  ray  I  =  1  i  =  i 


(4.4.2) 


(since  corresponding  points  in  Im\  and  Im 4  have  the  same  x  coordinates). 
Equation  (4.4.2)  proves  proposition  3. 


5.  Recovering  Three-Dimensional  Translation  Without  Correspondence 


Consider  the  projections  of  the  sets  A  and  A '  on  Im\ .  Furthermore,  consider  a 
point  (*i  j,  yii)  and  its  dynamic  corresponding  one  (xi;',  yn’).  (Note  that  we  do  not 
consider  point  correspondence,  i.e.,  we  do  not  worry  for  the  moment  where  the 
position  of  (xii'.yii  )  is.)  From  Section  3  we  have: 
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If  we  write  Equation  (5.1)  for  all  the  pairs  of  corresponding  points  and  we  sum  up 
these  equations,  we  get 


(5.3) 


Assuming  that  the  motion  in  depth  is  small  with  respect  to  the  depth  equation, 


(5.3)  can  be  approximated  by: 
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Similarly,  with  Equation  (5.2)  we  obtain 
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If  we  apply  the  same  procedure  for  the  projections  of  the  sets  A  and  A  on  I m2  we 
get  two  more  equations.  One  of  them  is  the  same  as  (5.5),  and  the  other  is: 


(5.6)  ! 


Equations  (5.4)  through  (5.6)  constitute  a  linear  system  in  the  unknowns  AX, 
AY,  AZ,  which  always  has  a  unique  solution,  given  by: 
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Note  that  the  denominators  in  the  expressions  (5.7)  through  (5.9)  are  always 
different  from  zero  (for  dx,  dy  non-zero). 

We  now  proceed  with  the  experimentations. 
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6.  Experiments,  the  Effect  of  Noise  and  Practical  Considerations 


First  of  all  we  must  admit  that  we  were  expecting  a  small  error  in  the 
computed  parameters  due  to  the  approximations  done  in  the  development  of 
Equations  (5.4)  through  (5.6)  (i.e.,  E(xii /Z;')  =  2(*i j/Zj),  £(yi;/Z;')  =  £(yii/Zj), 
and  Z(x2i/Zj')  =  S(x2 i/Z;)),  but  experiments  showed  that  when  the  motion  in 
depth  is  small  with  respect  to  the  depth,  this  error  is  negligible. 


In  our  experiments  we  considered  a  set  of  three-dimensional  points,  we 
projected  them  on  each  of  the  four  frames,  and  then  we  gave  the  three- 
dimensional  points  a  rigid  translation  and  we  projected  them  again  on  the  four 
frames.  Discretization  effects,  when  the  three-dimensional  translation  is  not 
small,  hardly  affect  the  results.  Our  experiments  with  noise  indicate  that  the 
method  seems  to  be  immune  to  small  percentages.  When  we  say  that  a  frame 
has  a%  noise,  we  mean  that  if  the  frame  contains  n  points  then  an/100  of  the 
points  are  randomly  generated  using  a  random  number  generator.  Note  that  in 
all  we  have  eight  frames,  four  before  the  motion  and  four  after  the  motion.  And 
the  noise  we  added  was  not  necessarily  of  the  same  amount  in  all  these  different 
frames;  so  when  we  talk  about  a  noise  of  a%  we  mean  that  the  average  noise 
present  in  all  the  frames  is  a%,  and  on  the  other  hand  when  we  say  that  we  have 
an  error  of  P%  in  the  translation,  we  mean  that 
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where  (AX,  AY,  AZ)  the  actual  translation  (with  AXAYAZ  *  0)  and  (AX, 

-A 

AY,  AZ)  the  computed  ones. 


Furthermore,  if  the  set  of  three-dimensional  points  is  not  only  translating 
but  is  also  experiencing  small  rotations  (less  than  20°),  around  an  axis  passing 
through  the  center  of  gravity  of  the  points,  then  the  computed  three-dimensional 
translation  is  hardly  affected  (error  less  than  5%). 


In  a  practical  situation  (real  images),  operators  have  first  to  be  applied  on  all 
eight  frames  (four  before  the  motion  and  four  after  the  motion)  that  will  produce 
points  of  interest  [47,  48,  1,  4,  9]  in  all  images,  and  then  the  theory  developed  in 
this  paper  is  applied  to  these  points.  But  any  method  that  will  produce  points  of 
interest  from  the  intensity  images  is  bound  to  have  errors  due  to  the  noise  in  the 


images  and  the  unpredictability  of  the  natural  scenes.  So,  the  number  of  points 
will  not  be  the  same  in  the  four  frames,  neither  before  nor  after  the  motion.  But 
despite  the  fact  that  our  theory  is  built  on  the  assumption  that  the  number  of 
points  is  the  same  in  all  frames,  our  experiments  show  that  even  if  the  number 
of  points  in  the  different  frames  is  not  the  same  (at  most  a  difference  of  5%),  the 
results  are  hardly  affected. 

At  this  point,  we  should  mention  that  the  equations  used  in  the  experiments 
are  modified  so  that  they  can  capture  the  difficulties  from  the  different  number 
of  points  in  the  various  frames.  In  particular,  we  do  the  following.  Equations 
(5.4),  (5.5),  and  (5.6)  are  not  affected  if  both  sides  are  divided  by  the  number  of 
points  (which  is  supposed  to  be  the  same  number).  For  example,  Equation  (5.4) 
becomes: 
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If  the  number  of  points  in  the  first  frame  before  the  motion  is  n\  and  after  the 
motion  n\',  then  the  above  equation  is  written  as: 

IV  1*1.  A  1  _  1  AZ  xi, 

11  ill 

The  same  procedure  is  applied  to  the  rest  of  Equations  (5.5)  and  (5.6),  as  well  as 
for  the  computation  of  the  quantities  E(l/Z;),  E(1  /Zj'),  E(xi ;/Z;),  and  E(yu/Zi). 
Clearly,  this  is  an  approximation,  which  seems  very  robust  from  extensive 
experimentations. 

The  table  presented  in  Section  2,  showing  the  error  in  the  translation  vs. 
noise  in  the  images,  has  been  produced  by  running  100  simulations  for  each 
noise  case  and  then  averaging  and  taking  the  ceiling  of  the  computed  errors. 
Finally,  it  is  worth  mentioning  that  our  experiments  indicate  that  discretization 
effects  hardly  affect  the  result,  provided  that  the  retinal  motion  is  large  enough 
(at  least  five  pixels). 

Finally,  the  appendix  contains  pictures  from  our  experiments.  Every  picture 
shows  four  frames  before  the  motion  and  the  same  four  frames  after  the  motion. 
The  object  that  is  imaged  consists  of  connected  points.  The  noise  points  are 
randomly  generated  and  are  not  connected.  The  pictures  in  the  first  dynamic 


frame  (before  the  motion)  are  with  green  color,  and  the  ones  after  the  motion  are 
with  yellow.  The  noise  sources  are  three:  (1)  the  randomly  put  points;  (2) 
discretization;  and  (3)  rotation.  The  noise  percentage  written  captures  only  the 
randomly  generated  points.  (When  we  write  that  the  noise  level  is,  for  example, 
10%,  we  only  mean  that  10%  of  all  the  number  of  points  in  a  frame  is  randomly 
generated;  this  noise  percentage  does  not  include  rotation  and  discretization. 
Furthermore,  when  we  write  that  the  noise  percentage  is  10%,  we  mean  that  the 
average  noise  in  all  eight  frames  is  10%,  since  the  noise  in  every  frame  is  not  the 
same,  the  maximum  difference  between  any  two  frames  being  5%.)  The  actual 
parameters  and  the  computed  ones  (as  well  as  the  error  in  the  computed 
parameters  as  it  is  defined  previously)  are  shown  in  the  pictures. 

7.  Conclusion  and  Future  Directions 

We  have  proposed  a  method  for  recovering  the  three-dimensional  translation 
of  a  rigidly  moving  object.  The  method  seems  to  be  very  robust  against  noise  as 
well  as  small  perturbations  of  the  retinal  points  due  to  small  rotations  of  the 
object.  We  have  showed  that  combination  of  stereo  and  motion  is  a  promising 
way  of  approaching  the  motion  determination  problem,  as  it  has  already  been 
appreciated  by  Huang  and  Blostein.  But  we  have  also  showed  that  at  least  for 
the  case  of  translation,  we  can  face  the  problem  without  having  to  go  through 
the  intermediate  stage  of  the  computation  of  point  correspondences,  neither 
static  nor  dynamic.  Due  to  the  special  arrangement  of  cameras,  we  are  not  able 
at  this  point  to  recover  the  rotation  parameter  in  the  case  where  the  object  is 
translating  and  rotating,  but  as  we  have  already  stated,  the  method  is  immune 
to  small  rotations.  We  are  currently  working  on  addressing  the  general  problem 
(translation  and  rotation)  without  correspondence,  as  well  as  making  a 
theoretical  analysis  of  the  error  of  the  translation  parameters  that  are  computed 
from  our  algorithm. 
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